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Abstract: In this paper we study master-worker scheduling of divisible loads in heterogeneous dis¬ 
tributed systems. Divisible loads are computations that can be arbitrarily divided into independent 
“chunks”, which can then be processed in parallel. In multi-round scheduling load is sent to each worker 
as several chunks rather than as a single one. Solving the divisible load scheduling (DLS) problem entails 
determining the subset of workers that should be used, the sequence of communication to these workers, 
and the sizes of each load chunk. We first state and establish an optimality principle in the general case. 
Then we establish a new complexity result by showing that a DLS problem, whose complexity has been 
open for a long time, is in fact NP-hard, even in the one-round case. We also show that this problem 
is pseudopolynomially solvable under certain special conditions. Finally, we present a deep survey on 
algorithms and heuristics for solving the multi-round DLS problem. 
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Sur la complexite de I’ordonnancement en plusieurs tournees de 

taches divisibles 


Resume : Dans cet article, nous nous interessons a rordonnancement maitre esclave de taches divisibles. 
Une tache divisible est un calcul pouvant etre arbitrairement decoupe en sous-taches independantes et 
pouvant done etre traitees en parallele. Dans un ordonnancement en plusieurs tournees, chaque esclave 
regoit ses sous-taches en plusieurs fois plutot qu’en une seule, ce qui permet un meilleur recouvrement 
des communications par des calculs. Un ordonnancement en plusieurs tournees est caracterise par le 
sous-ensemble d’esclaves utilises, I’ordre dans lequel les communications vers ces esclaves sont effectuees, 
et la taille de chacune des sous-taches. Nous etablissons des resultats de complexite originaux sur ce 
probleme. Nous enongons et demontrons un principe d’optimalite pour le cas general. Nous montrons 
la NP-completude meme dans le cas en une seule tournee. Nous proposons egalement un algorithme 
pseudo-polynomial pour certaines situations. Nous montrons que dans toute sa generalite, il est difficile 
de montrer que ce probleme est dans NP et nous faisons un etat de I’art des differentes techniques 
(exactes, garanties ou non-garanties) pour resoudre ce probleme. 

Mots-cles : taches divisibles, plusieurs tournees, programmation lineaire, complexite, principe d’optimalite 



1 Introduction 


The problem of assigning the tasks of a parallel application to distributed computing resources in time 
and space in a view to optimizing some metric of performance is termed the “scheduling problem”. It has 
been studied for a variety of application models. Popular models include the directed acyclic task graph 
model [24], and the simpler independent task model in which there is no precedence or communication 
among computational tasks [13]. These models are representative of many applications in science and 
engineering. Typically the number of tasks, their communication and computation costs, are set in 
advance. The scheduling problem is known to be difficult [22]. For instance, in the independent task 
model, the scheduling problem is akin to bin-packing and, as a result, many heuristics have been proposed 
in the literature (see [15] for a survey). Another flavor of the independent tasks model is the one in which 
the number of tasks and the task sizes can be chosen arbitrarily. In this case, the application consists 
of an amount of computation, or load, that can be arbitrarily divided into any number of independent 
pieces, or chunks. In practice, this model is an approximation of an application that consists of a large 
number of identical, low-granularity units of load. This divisible load (DL) model arises in practice in 
many domains [18, 11, 25, 27, 14, 6, 21, 30] and has been widely studied in the last decade [9, 10, 29]. 
This paper focuses on the Divisible Load Scheduling (DLS) problem. 

We consider distributed computing platforms in which the compute nodes are interconnected in a 
logical star topology (i.e., a single-level tree), which is a popular and realistic model for deploying DL 
applications in practice. The application runs in master-worker fashion, i.e., the root of the star (the 
master) initially holds all application input data and dispatches work to the leaves of the star (the 
workers). We make the common assumptions [9] that the master sends data to only one worker at a 
time (i.e., the “one-port” model), and that workers can compute and communicate simultaneously (i.e., 
the “with front-end” model). We focus on heterogeneous platforms, meaning that the communication 
and computation rates of different workers can be different. Computing a DL schedule entails three 
steps: (i) select which workers should participate in the computation; (ii) decide in which order workers 
should receive load chunks and how many times; and (iii) compute how much work each load chunk 
should comprise. Previously proposed solutions to the DLS problem fall into two categories: one-round 
schedules and multi-round schedules. In one-round schedules, each worker receives only one load chunk. 
In multi-round schedules, each worker may receive multiple load chunks throughout application execution. 

Multi-round schedules have been shown to be preferable to one-round schedules because they allow 
for better overlap of computation and communication [2]. Unfortunately, designing multi-round DLS 
algorithms is more challenging and fewer results are available in the literature. One key difficulty with 
multi-round scheduling is due to the presence of start-up costs, that is fixed amounts of time that must 
be spent when sending data over a network. It is known that while one could model the time to send 
some data over a network as linear in terms of the data size. A better model is to view communication 
delay as affine in the data size, with a constant portion that corresponds to the overhead of initiating 
a network connection and to the physical network latency. Modeling this start-up cost is important to 
be relevant to practice, especially as computing platforms that span wide-area networks have emerged 
and are prime candidates for loosely-coupled applications such as DL applications [20]. Furthermore, 
modeling data transfer costs as linear in data size leads to schedules that divides the load into an infinite 
number of chunks that each need to be sent to workers. Such a schedule would lead to infinite overhead 
in practice. With start-up costs, the scheduling algorithm must pick an optimal finite number of chunks, 
which is difficult. Furthermore, without start-up costs, all workers can be utilized as there is no penalty 
for using even a slow worker. With start-up costs however, the scheduling algorithm must pick workers 
carefully. Thus, while modeling start-up costs is more relevant to practice, it makes DLS significantly 
more challenging. 

In this paper we focus on the multi-round DLS problem on heterogeneous star networks with com¬ 
munication start-up costs and we make the following contributions: 

1. We give the first proof, to the best of our knowledge, of the NP-hardness of the multi-round DLS 
problem. This result is obtained by first proving NP-completeness for the one-round case, which is 
a novel result as well. 

2. We propose a pseudo-polynomial algorithm to solve the multi-round DLS problem in a particular 
case. 

3. We give an algorithm that computes the optimal solution to the multi-round DLS problem (in expo¬ 
nential time). This algorithm relies on a Mixed Integer Linear Programming (MILP) formulation. 
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4. We conduct an experimental evaluation, using simulation, of previously proposed heuristics. In 
particular, for each heuristic we quantify the trade-off between the time to compute the schedule 
and the quality of the schedule. To the best of our knowledge, this is the most complete such 
evaluation to date, both in terms of the heuristics and of the range of experimental scenarios. 

This paper is organized as follows. We give the NP-completeness proof and the pseudo-polynomial 
algorithm in Section 4. Section 5 describes the MILP algorithm, while Section 6 highlights previously 
proposed heuristics. Section 7 summarizes our results and discusses future directions. 


2 Problem Definition and Notations 

Consider a DL application that consists of W independent units of load to be processed. The processing 
of each load unit involves performing some computations on some input data. Initially, input data are 
located on a master computer. Without loss of generality, we assume that the master does not perform 
any computation. The master can send input data for one or more load units to p workers. Worker i can 
process a chunk of x load units in xAi seconds, and the master can send a chunk of x load units to worker 
i in Si + xCi seconds. We assume that the Ai’s, Si’s, and Ci’s are integer, while x and W are rational. We 
assume that the master cannot send chunks to more than one worker at a time, following the one-port 
model. We also assume that a worker may compute and receive data simultaneously. However a worker 
has to wait for a chunk to be completely transfered before starting processing it. Both these assumptions 
are commonly used in the DLS literature. We do not consider transfer of output data back to the master. 
This is also a common assumption in the literature (interested readers can find a discussion of output 
data in [9, 18, 31]). 

The problem we consider in this paper is: how should the master partition the load into chunks and 
send those chunks to the workers so that the application makespan, i.e., the time at which the last unit of 
load is completed, is minimized? A schedule consists of a sequence of workers to which the master sends 
load chunks in order, which is called the activation sequence, and the size of each load chunk. In the rest 
of the paper we denote by the size of the chunk of load sent to worker i, measured as a rational 
number of load units, denotes the size of the chunk of load sent to worker i in case only one chunk 
is sent to worker i in the schedule. Some workers may not be used in the schedule and do not appear 
in the activation sequence. In the following, we will denote as actmax, the largest number of activations 
allowed in an activation sequence. In the one-round case, a worker can only appear once in the activation 
sequence. The typical notion of multi-round used in the literature assumes that the activation sequence 
is periodic. Hence, if we denote by r^nax the largest number of rounds allowed in a multi-round schedule 
and by Xsize the number of processors used in each round, we have actmax = T’maxi’size- In this paper we 
impose no periodicity constraints on the activation sequence and instead consider the general case (e.g., 
some workers may appear twice as often as other workers in the activation sequence). 

We define the associated decision problem as: 

Problem 1 (DLS). Given p workers, (Ai)i^i^p, (5*^)1^^^^, (Ci)i^i^p, and two rational numbers W^O 
and T ^ 0, is it possible to compute all W load units within T seconds after the master starts sending 
out the first load unit? 

In the following, we may consider some restrictions of the DLS problem. These restrictions will be 
denoted as DLS{restriction} where restriction may be for example: IRound (all processors are used at 
most one time), Ci =0 (bandwidths from the master to the slaves are infinite). Si =0 (no latency), and 
so on. 

Similarly, we define the two optimization problems: 

Problem 2 (DLS-OptT). Given a rationnal workload W, p workers, (Ai)i^i^p, {Si)i^i^p, (C'i)i<gi^p, 
what is the smallest rational number T ^ 0 such that is it possible to compute all W load units within T 
seconds after the master starts sending out the first load unit? 

Problem 3 (DLS-OptW). Given a rationnal time boundT, p workers, (Ai)i^i^p, {Si)i^i^p, {Ci)i^i^p, 
what is the largest rational number W ^ 0 such that is it possible to compute all W load units within T 
seconds after the master starts sending out the first load unit? 
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3 Optimality Principle 

In this section, we discuss the Optimality Principle first proposed by Robertazzi [9]. We start by recalling 
how DLSjFixedActivation}, DLS{FixedActivation}-OptT, and DLS{FixedActivation}-OptW can be solved 
in polynomial time and then we discuss the precise formulation of the Optimality Principle. Last, we 
prove this optimality principle. 


3.1 DLSjFixedActivation} and Linear Programming 

Consider a given instance / = (S', (7,71) of the problem. Let a : {!,..., n} ^ {!,...,p} denote a 
given activation sequence of size n. Then if we denote by aj the amount of workload sent to Pa^j) i 
DLS{ Fixed Activation} is equivalent to determining whether the following linear constraints define a non¬ 
empty set: 

n 

(la) 

< ^ ( 1 ) 

(lb) Vfc ^ n (Sct(j) + “F ^ ^ C(jC'(T{k) ^ T 

j=l j ^ k : a(j) = a{k) 

. (Ic) Vj < n : aj ^ 0 

The leftmost part of Constraint (lb) represents the time at which the communication ends and 
the middle one is a lower bound on the computation time of worker a{k) after this communication. The 
sum of these two times has thus to be smaller than the makespan T. Considering in backward order the 
activations where a given worker I is used, it is not hard to see from the constraints that one will obtain 
a feasible schedule [16]. 

Likewise, DLS{FixedActivation}-OptT is equivalent to the following linear program: 


Minimize T , 

UNDER THE CONSTRAINTS 
n 

(2a) = W 

(2b) ^E “F 'y C(jC'a(k) ^ T 

j=l j ^ k : a(j) = a{k) 

. (2c) Vj < n : aj ^ 0 


and DLS{FixedActivation}-OptlP is equivalent to the following linear program: 


Maximize W = > 

UNDER the constraints 

( k 


(3a) 


Vfc ^ n : E ('5'cr(j) + “F y ] 0:jC'cr(k) ^ ^ 

i=l j ^ k : a(j) = a(k) 


(3b) V} < n : aj ^ 0 


( 2 ) 


(3) 


We can define the two functions: 

up ./ Er=i-5'<T(i),oo[^ [0,oo[ 

’ } Ti—> sup{lT|DLS(/, CT, IT, T) has a solution} 

( [0,oo[^ E”=i5'<,(i),oo[ 

■ } Wi—*■ inf{T|DLS(/, a, W, T) has a solution} 

Theorem 1. Wa and are continuous piecewise-linear functions and are inverse of each other. 

Proof. Liner program (3) can be written: 


Maximize J2j=i > 
UNDER THE CONSTRAINTS 

fvfc < n : {B^a)k ifT-Ck 
1 Vj ^ n : Oj ^ 0 


(4) 
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which has the same optimal value as its dual linear program 


Minimize YTk=i{T - Ck)yk , 
UNDER THE CONSTRAINTS 

fvfc < n : {Bly)k ^ 1 

[Vfc < n : ?/fc ^ 0 


Let us denote by Va = {j/|Vfc ^ n : {B^.y)k ^ 1 and yk ^ 0}. Va is a convex polyhedron and we 
know that optimal solutions to the linear programs are found on the vertices of their polyhedron. Let us 
denote by SVa the set of vertices of Va- Therefore we have: 

Wa{T) = min | ^(T - Ck)yk\y G SVa 
[k=l 

Note that SVa is finite and does not depend on T. As for each y e 5Va, T J2k=ii'^~ ^k)yk is an affine 
function, is thus the minimum of affine functions, hence a continuous piecewice-linear function. As 
Wcr is strictly increasing, Wa- is a bijection and T^- is its inverse function. is hence also a continuous 
piecewice-linear function. □ 

Corrolary 2. Wopt and Tgpt are continuous piecewise-linear functions and are inverse of each other. 

3.2 Stating the Optimality Principle 

Linear programs as (2), (3) will therefore be used in many heuristics that only come up with an activation 
sequence (e.g., the heuristics of Section 6.1). This approach differs from the solution of [9] where it is 
assumed that optimal sequence have no idle times, i.e. that selected processors start working as soon as 
they receive their first chunk and keep working until the end of the schedule. In other words, they compute 
all the time and all stop computing at the same time. This assumption simplifies linear inequalities (lb) 
into linear equations. Thus, the formulations (2) and (3) reduce to a system of linear equations, which can 
be solved in 0{n) time due to the particular structure of the system. The fact that the optimal sequence 
has no idle time is known in the literature under the name of “Optimality Principle” [9]. However, for an 
arbitrary activation sequence, this assumption does not hold true: there may be idle times (see Figure 1). 
Figure 1(b) proves that there exists an activation sequence such that the optimal load distribution for 
DLS-OptT or DLS-OptVF has idle time. However, one may argue that in this example processor 1 does 
not work at all and the optimality principle could then be reformulated as: 

Proposition 3. For a given activation sequence, in an optimal load distribution (both for 
DLS{FixedActivation}-OptW and for DLS{FixedActivation}-OptT), either a processor has no idle time 
or it does not receive any load. 

Unfortunately, this formulation does not hold either, as shown by the Figure 1(c). For an arbitrary 
activation sequence, a processor may receive some load and have idle time. Thus there is no hope to 
prove that an optimality principle hold for an arbitrary sequence. However, we can check that on this 
specific instance, the optimality principle holds for the optimal activation sequence (see Figure 1(d)). In 
the next section, we will prove optimality principle for optimal activation sequences. 

Close results have been proved in the past. For example, it has been proved for the one-round problem 
in [2] that in the optimal selection, all selected workers finish computing at the same time. For multi¬ 
round schedules on identical processors it has been shown [28] that there are no idle times neither in the 
communications nor in the computations. 

3.3 Proving the Optimality Principle 

In this section, we state and prove the optimality principle for optimal activation sequences and DLS- 
OptW using similar ideas as [2]. 

Theorem 4 (Optimality Principle). For an optimal activation sequence and the corresponding optimal 
load distribution in DLS-OptW and DLS-OpfF , all messages, except maybe the trailing ones, convey 
some load and there is no idle time. 
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(a) Optimal load distribution for <7 = (2,1) 
and T = 70/12: Wopt = 2 and all processors 
stop working at the same time. 
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(b) Optimal load distribution for a = (1, 2) 
and T = 70/12: Wopt = Processor 1 
does not work at all and thus finishes before 
time T. 
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(c) Optimal load distribution for a = (2,1,2,1,2) and T = 19: Wopt = 10. Processor 2 
receives some load and has idle time. 
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Transfer 
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Transfer 
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(d) Optimal activation sequence (cr = (2,1, 2,1, 2)) and optimal load distribution for T = 
19: Wopt = All processors receive some load and there is no idle time. 


Figure 1: DLS{FixedActivation}-OptlF for Ai) = (1,10,1), A^) = (2,1,1). 
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(b) Decreasing initial chunks so that the communication gap disappears 


Figure 2: Closing the gap after an empty message. 


Proof. Let us consider an optimal activation sequence cr for some instance I of DLS-OptVF . Note that 
if there exists processor Pi such that Si < T — Yl^=ii^crU) then it is possible to append a 

message to Pi with no load to the original communication sequence a without increasing schedule length 
T or violating the original schedule. Such an activation sequence is not minimal. In the further discussion 
we assume that a has no trailing empty messages. We assume that cr is not empty, and is feasible for 
the given T, i.e. X]”=i < T. 

For a given a DLS{ FixedActivationj-OptW is equivalent to linear program (3). Let us call a the 
corresponding load distribution. It is known that the optimal solutions of linear programs are either on 
vertices of the polyhedron defined by the linear constraints, or a whole facet of the polyhedrons. 

Assume that the optimum solution of the linear program (3) is in the corner of the polyhedron. 
(3) has n variables and 2n constraints. Therefore, at least n constraints amongst 2n are equalities. 

• If none of the constraints (3b) is an equality in the optimal solution then all the constraints (3a) 
are equalities, which means that there is no idle time, and all messages convey some load. 

• If / > 0 constraints (3b) are equalities, then n—l constraints (3a) are equations, and I messages carry 
no load. The n—l remaining non-zero aj s satisfy, n—l constraints (3a) which are equations. Hence, 
there are idle times neither in the communications to nor in the computations of the processors 
receiving any load. 

The I messages with no load contributed only some startup times in the communications. We will 
show that by removing the startup times of the I messages with no load from the schedule, W 
can be increased without increasing schedule length T which will contradict the assumption that 
sequence a and its distribution are optimum for T. 

Consider an empty message k directed to some processor Pa ■ Suppose there is some other processor 
Pb with two nonempty messages which enclose the empty message to Pa (cf. Figure 2(a)). Thus, 
there are messages j < k < I such that a{j) = a{l) = b, and aj > 0, a; > 0. Let Xi be the duration 
of the messages which follow message j, and precede message k. Let X 2 be the duration of the 
messages which follow k, and precede 1. Together xi + X 2 = x. Let P[ be the length of the interval 
since the beginning of communication j till the end of the computation 1. Since there were no idle 
times in the communications and computations on the processors which received non-zero load we 
can observe that: 


CtjAb — X -\- Sa -F S';, -F OLiCb 

H = Sb “F CXjCb X Sa “F Sb “F OtiCb “F OtiAb 
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From which we obtain: 


HCh + SbAt + (a; + Sa)-Aii — StCt 

Af + AhCb + 

H Ah — 2SbAh — (x + Sa){Ab + Cb) — SbCb 
A^ + AbCb + Cj 


and from the above: 


C(j Q-l — 


H{Ab + Cb) — SbAb — (a: + Sa)Cb 
Ab + AbCb + Cj 


2SbCb 


The interval of startup time can be closed by increasing size of message 1. Thus, a new schedule 
can be constructed such that there are no idle times in the communications and computations on 
Pb (cf. Figure 2(b)). Analogously to the previous reasoning we have: 

CXjAb = X Sb OtiCb 

U = Sb OtjCb ~\~ X -\- Sb OCiCb P O-iAb 


From which we obtain: 

, HCb + SbAb + xAb — SbCb , HAb — 2SbAb — x{Ab + Cb) — SbCb 
^ Al + AbCb + Cl = Al + AbCb + Cl 

, , H{Ab + Cb) — SbAb — xCb— 2.SbCb 

= - Al + + Cl -■ 

The amount of processed load increased by a' +a[ — Uj —ai— > 0. In the new sched¬ 
ule communication j with size a' finishes earlier by > 0 units of time. Analogously, 

message I in the new schedule with size finishes Sa ^ > 0 units of time earlier 

than in the old schedule. Hence, the new schedule does not delay initiation of the computation 
on any other processor because the new messages j, I are finishing earlier than in the initial sched¬ 
ule. Thus, schedule length does not increase, but the amount of processed load increased which 
contradicts the assumption that a is optimum and W is maximum for the given T. 

Assume that the empty message k is not enclosed by two nonempty messages to any processor. Since 
a is nonempty, message k is either preceded or succeeded by the nonempty message (s). Suppose 
message is only followed by nonempty messages. By shifting a wole schedule by Sa-(k) units of 
time earlier, we still get a valid solution for DLS(bF,T — Sa(k))- As Wopt is strictly increasing, it 
is possible to perform strictly more work in time T than W, which contradicts the assumption 
that cr is optimum. Last, suppose there are no nonempty messages after k. This contradicts the 
assumption that the trailing messages are nonempty. 

Consequently, the optimum sequence and the corresponding optimal load distribution have no 
empty messages, and idle times neither in communications nor in computations. 

Thus, we know that the only optimal vertices are the ones such that all the constraints (3a) are equalities. 
As there is only one such vertex, the facet case cannot happen. 

We have just proved that the optimality principle holds true for DLS-OptVF . As DLS-OptkF and 
DLS-OptT are inverse of each other, the optimality principle also holds true for DLS-OptT. □ 

4 Complexity of Multi-Round DLS 

4.1 Previously Obtained Complexity Results 

Previous works have studied the complexity of the DLS problem on a heterogeneous star platform with 
affine communication costs. 

A result given by Bharadwaj et al. [7] states that DLS{l-round,S'i =0} can be solved by sorting 
processors by increasing C’s in the activation sequence. The assumption that all Si’s are equal to zero 
does however not hold in practice. This is why Blazewicz and Drozdowski introduce in [12] DLS problems 
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in which communication incurs a start-up cost (i.e., Si ^ Q), and they solve them in certain special cases. 
Unfortunately, the general DLS problem is difficult due to the need to determine the optimal activation 
sequence, which is highly combinatorial. In fact, it is known that solving the DLS problem (as well as 
DLS-OptT and DLS-OptkU ) for a given activation sequence can be done with polynomial complexity 
(e.g. see [4] and Section 3). In [2] it is shown that the difficult activation sequence computation problem 
can be bypassed if one assumes that the total load is “sufficiently large”. In this case, all start-up costs 
are much smaller than the makespan, and workers should be sorted by increasing C^’s just like when all 
Si’s are zero. In spite of these results, it is acknowledged that the complexity of the general DLS problem 
above is open [2]. 

In [19], the authors study the DLS problem with added “buffer constraints”, i.e., for each worker a 
bounded number of load units can be stored on that worker. This limitation essentially provides one 
more condition, which helps when reducing from known NP-complete problems [19]. In [3] this result 
is strengthened by proving that the DLS problem with buffer constraints is NP-complete in the strong 
sense. In this paper we prove the NP-completeness in the weak sense of the general DLS problem without 
additional buffer constraints. 

4.2 One-Round DLS is NP-Hard 

In this section we study one-round schedules, that is the ones in which each worker appears at most 
once in the activation sequence. Without loss of generality, we assume that the bandwidth between the 
master and each worker is infinite, i.e., the time to send x load units to worker i is Si seconds. We now 
consider the following associated decision problem: 

Problem 4 (DLS{lRound, Ci =0}). Given W, p workers, a rational number 

T ^ 0, and assuming that bandwidths are infinite, is it possible to eompute all W load units within T 
time units? 

We prove that DLS{lRound, Ci =0} is NP-complete. Since DLSjlRound, Ci =0} is a special case 
of DLS, we obtain the NP-hardness of the more general DLS. DLSjlRound, Ci =0} is difficult because 
the total communication start-up times, X]i<i<Ar'S'o may be larger than T. Therefore, one must use 
a carefully chosen subset of the workers, which gives the problem a combinatorial flavor. Intuitively, 
for an instance to satisfy DLSjlRound, Ci =0}, it has to meet two requirements: (i) have a makespan 
lower than T, meaning that the sum of communication start-up costs of the selected workers must be 
small enough] and (ii) compute more than Wq units of load, meaning that the compute speeds of the 
selected workers must be large enough. Those two requirements suggest a reduction from the 2-Partition 
problem. In the reduction from 2-Partition to DLSjlRound, Ci =0}, we have the following variables at 
our disposal, which can be set freely to “force” the selection of workers: W, T, One 

must then carefully choose a small enough T, and a large enough W. 


Theorem 5. DLS{lRound, Ci —0} is NP-Complete. 

Proof. We first show that DLSjlRound, Ci =0} is in NP. A solution to the problem consists of an 
activation sequence and load chunk sizes. An activation sequence is a string of length at most p. For a 
given activation sequence, it is known that one can compute the load chunk sizes in polynomial time [4] 
(see also Section 5.1). Therefore, given a solution to an instance of DLSjlRound, Ci =0}, one can verify 
in polynomial time that the subset of workers complete workload W within time T. 

We prove that DLSjlRound, Ci =0} is NP-complete via a reduction from the NP-complete 2-Partition 
problem [22], which is defined as follows: 

Problem 5 (2-Partition). Given a finite set B of integers bi, 1 ^ i ^ 2m, is there a subset B' <Z B 
such that \B'\ = m and 

bi&B-B' bi&B' 

Given an instance of 2-Partition, we construct an instance of DLSjlRound, Ci =0} as follows. For 
each bi we create a worker Pi, with start-up cost Si and computation speed such that Si = ^ = M+bi, 
for a total of p = 2to workers. We then choose T = mM-|-L-|-i, and IF = ^(m—l)M^+(m—1)LM+^M. 
We must chose M above as a “large” number; it turns out that choosing M = is sufficient. 

Figure 3 depicts a schedule with four workers with time on the horizontal axis (from time 0 to time 
T) and the four workers on the vertical axis. For each worker we show a communication phase (in white). 
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Figure 3: Illustration to the proof of Theorem 5. 


followed by a computation phase (in various shades of gray, whose meaning will be explained shortly). 
Note that all workers finish computing at the same time. Indeed, since we have a fixed cost for sending 
chunks over the network, one can easily send enough load to each worker to keep it busy until time T. 
Clearly, the number of load units that a worker contributes to the computation is the product of its 
computational speed, and the duration of the time interval from the end of communication to the 
overall application finish time. To make the proof simpler to follow, we let the width of each worker slot 
in the figure be S'i, so that the worker’s contribution to the overall computation number of work units is 
given by the area of its slot. In the proof, we will compute such areas in order to estimate numbers of 
computed load units. We prove the reduction with the usual two steps. 


Lemma 6. 2-Partition ^ DLS{lRound, Ci =0} 

Proof. If we have a solution to the 2-Partition problem, we show that we also have a solution to the 
DLS{1 Round, Ci =0} problem. Pick all m workers Pi such that bi G B' . For simpler notations, and 
without loss of generality, we assume that i = 1,..., to. First, we note that the sum of the communication 
start-up costs is 

m m 

Si = bi) = niM L <T. 

i=l i=l 

Consequently, all to workers can participate in the computation and appear in the schedule. 

Given the set of workers participating in the schedule, we now estimate the number of load units that 
are computed before time T, which corresponds to the shaded area shown in Fig. 3. The figure shows the 
shaded area partitioned in five different types of rectangular zones. Zone A consists of to(to— 1)/2 squares 
of dimension M x M. Zone B consists of to — i rectangles of dimension M x bi for i = 1,... ,m — 1, for a 
total of to(to — l)/2 such rectangles. Zone C is similar to zone B, but with i — 1 rectangles of dimension 
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bi X M for i = Zone D, which visually corresponds to the intersections between rectangles 

in zones B and C, consists of m(rn — l)/2 rectangles of dimensions hj x bi, for i = 1,... ,m — 1, and 
j G [2,...,m] with j > i, and of m rectangles of dimension i x bi, for i = 1,... ,m. Finally, zone E 
consists of m rectangles of dimension ^ x M. 

We compute the sum of the areas of zones A, B, C, and E. The total area of zone A is clearly 
M^m(m — l)/2. The total area of zone B is: 

M((to — l)&i + (to — 2)62 + ... + bm-i). 


The total area of zone C is: 

Af(62 + 2&3 + 364 + ... + (to — l ) bm ). 

Finally the total area of zone E is Because we have not counted the area due to rectangles in zone 

D, we can bound below the total number of computed load units, W, as follows: 


W ^ 





m—1 m—1 

l)M^ + mY^ ib^+i + M ^ (to - i)h + 

i^l 


1)M2 + (to - 1)M E 

1)M^ + (to - 1)ML + 


We conclude that we have a solution to the DLSjlRound, Ci =0} problem. □ 

Lemma 7 . DLS{lRound, Ci —0} => 2-PARTmON 

Proof. If we have a solution to the DLS{lRound, Ci =0} problem, we now show that we also have a 
solution to the 2-Partition problem. First, we know that in the solution to DLSjlRound, Ci =0} we 
cannot have more than to workers. Otherwise, the Si startup costs would add up to a time larger than 
T. We can also see that we need at least to — 1 workers. Indeed, a smaller number of workers will not 
suffice because the overall number of computed load units will be strictly lower than W. (Intuitively, we 
waste the opportunity to compute at least load units when using to — 2 workers.) Therefore we must 
use either to — 1 or to workers. However, we now prove that we cannot use to — 1 workers. 

Let us assume that the solution of DLSjlRound, Ci =0} uses to — 1 workers, and let us compute the 
total number of load units computed before time T. The intuition is that by not having the to**' worker 
we miss its contribution in zone E, which makes the overall number of computed load units strictly lower 
than W. We count the area in two parts, the area before worker to — 1 finishes communication (left of 
time instant ti in Fig. 3), and the area after that. For the first part, the squares in zone A sum up to 
M^{m— 1)(to —2)/2. Those in zones B and C sum up to M{m — 2) bi as in the previous section of 

the proof. Rectangles in zone D left of ti take up X]i<i<j<m-i bibj < i(TO — 1 )(to — 2)(2L)^ < 2wfL'^. 
The second part is easily computed as the area between ti and t 2 '. 
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We can add the two parts and obtain the total number of computed load units, W, as follows: 

^ ^ I (m - iKm - 2) ^2 ^ g 1 ^ 


(m - 1)M + bi 


i=l 


T- {m-l)M+ 


i=l 


(m ^ ^ 2 ^ 2^2 ^ 

- m—1 m—1 - - 


< 


< 


(m 2,(m - 1)M + imM 


2m^Z/^ + (^ + 9) E ^* ] “ 9^ 


2=1 


(W 1 


= w. 


Therefore, we cannot have a solution with m — 1 workers, and we must use exactly m workers. 

For the solution of DLSjlRound, Ci =0} with m workers we can write that the makespan is lower 
than T (otherwise it would not be a solution). The makespan is equal to the sum of the Si series and T 
is equal to mM + L + i. Therefore, we have: 

m ^ 

^ ^ Si ^ uiM L — 

i=l 


Replacing Si by its value, we obtain: 


Because integer, we have 


Y,h^L- 


i=l 


y^^bj ^ L. 


( 6 ) 


i=l 


We now estimate the total number of load units computed. The small area of zone D, denoted by V, is 

E + \ E ^ ^m{m - 1 )L^ + ii, 

because of Eq. 6. Then the total computed load W is: 


m 

yy = —{m—l)M'^ + M{m — l)'y^bi+'D-\—mM 


i=l 




I 11 1 

-(m — 1)M^ + M{m — 1) E bi + —mM + —m{m — 1)L^ + —L. 


Since the schedule is a solution of DLSjlRound, Ci =0}, we also have: 

W ^ W = ^(m — 1)M^ + (m — 1)LM + -mM. 


Therefore 

m 1 

M(to — 1) bi + -m{m — 1)L^ + -L ^ (m — 1)LM, 

i=l 

which, given that M = 8m^LS, implies that: 


^ L - 

i=l 


1 

16m 


(m 


1 

l)8m^L 


> L - 1. 


RR n° 6096 
















Because integer, we have: 


(7) 


^ L. 

i=l 

Inequalities (6) and (7) show that Therefore, there is a solution to the 2-Partition 

problem, which is obtained by picking all the bi values that correspond to the workers participating in 
the computation in the solution for the DLSjlRound, Ci =0} problem. □ 

Consequently, DLSjlRound, Ci =0} is weakly NP-complete, showing that the one-round DLS problem 
is NP-hard. □ 

Note that in the proof, we never rely on the assumption that the distribution is done in one round. 
As a consequence the decision problem (DLSICi =0}) associated to the multi-round DLS problem with 
infinite band widths is NP-complete too, and the multi-round DLS problem is NP-hard. As a matter of 
fact, in the infinite bandwidths setting, there is nothing to gain in making two communication to a given 
processor. Hence the optimal solution uses at most one round and DLSjlRound, Ci =0} =DLS{C'i =0}. 


4.3 A Pseudo-Polynomial Algorithm 

We analyze two dual optimization problems: DLSjlRound, Ci =0}-OptT: Given W find the shortest 
schedule of length T*; DLSjlRound, Ci =0}-OptIP : Given a schedule of length T find the maximum 
load W* that can be processed in this time limit. We will demonstrate that both problems can be solved in 
pseudopolynomial time if (Ci)i^isgp = 0. The algorithm we propose solves problem DLSjlRound, Ci =0}- 
OptW . In a dual problem the optimum schedule length can be found by use of a binary search over values 
of T. Recall that are integers, which allows W,T,W*,T* to be rational numbers. 

We first establish several facts. 


Proposition 8. For a given time limit T, and set V' C {Pi,...,Pp} of workers taking part in the 
computation the maximum load is processed if workers are ordered according to nondecreasing values of 
for P^GV'. 


Proof. The proof is based on an interchange argument. Consider two workers Pi and Pj that are consec¬ 
utive in the activation sequence. Let the communication to the pair start at time — Si — Sj. 

The communications with Pi, Pj are performed in interval [a:, a: -L S'i -L 5'^]. A change in the sequence of 
Pi,Pj does not influence the schedule for the other workers. The load processed by the two workers in 
sequence {Pi,Pj) is: 

jTij 

For sequence {Pj, Pi) the load processed by the two workers is 


W2 


T — X — Sj T — X — Sj — Si 
^ A 


From which we get 


Wi - W2 


Ai Aj 


Thus, the load is greater for the sequence {Pi,Pj) if SiAi < SjAj. 


□ 


Proposition 9. The maximum load W* that can be processed in time T can be found in 0{m min{ [TJ, YTi^i ^i}) 
time if = 0. 

Proof. Let us assume that some sequence of worker activation is fixed, and without loss of generality 
that it is Pi,... ,Pp. We only have to choose a subset of the workers. The amount of load that can be 
processed by Pi in time T, provided that it finishes communications at time t ^ Si and that Ci = 0, 
is Wi = max{0, W* can be calculated via function W{i,T), which is the maximum load amount 

processed by workers selected from set {Pi,..., P^} finishing communications at time r, for i = 1,... ,p 
and r = 1,..., minj [Tj, Ai}. W{i,T) can be calculated in 0(pmin{ [Tj, Ai}) using the 
following recursive equations: 


r W{i - 1,t) 

W{i,t) = 1 {W{i-l,T), 

y ^ {VF(f - 1,T - Aj)-Lmaxjo, T^} 


for T < Si 


for T ^ Si 
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for i = = 0 for 3 = 1^(0,r) = 0, for r = 

1,..., min{ [TJ, ^i}- The maximum load is W* = Si} W{m, r). Let r* satisfy 

W* = W{p,T*). The set of workers taking part in the computation can be found by backtracking from 
W{p, T*) and selecting those workers Pi for which W{i,T) = W{i — 1,t — St) + max{0, □ 

Theorem 10. The minimum schedule length T* for a given load W can be found in 0((log W + logp + 
log +log5'max)TOmin{[maxi Si + WAmax\,J2i=i 5^}) time if {Ci)i^i^p = 0 . 

Proof. Let Amin = minijAi}, Amax = vnax.ilAi}. In the optimum sequence workers are activated accord¬ 
ing to the nondecreasing order of SiAi by Proposition 8 . For a given schedule length T, the maximum 
problem size W* can be found in 0(pniin{ [Tj, 5^}) time according to Proposition 9. The mini¬ 

mum schedule length can be found by a binary search over the values of T. It remains to show that the 
number of the calls to the pseudopolynomial time algorithm is limited. 

Let = 1 if Pi takes part in the computation, and Xi = 0 otherwise. Thus, vector x = [xi,... ,Xp] 
represents a subset of {Pi,..., Pp}, the workers that take part in the computation. The load amount 
that can be processoed in time T is 


p 

w = j2 


Tx^ 


EE 

1=1 j=i 


XiXj Di 

^4~' 


( 8 ) 


is the amount of load that could have been processed provided that there were no communication 
delays. XiSi is the computation time lost due to the activation of Pi. This loss of computing time affects 
workers Pj for j ^ i because the activation sequence is fixed. Hence, is the amount 

of load lost due to the communication delays. It follows from equation ( 8 ) that IF is a piecewise-linear 
nondecreasing function of T as required by Theorem 1. Therefore, the optimum T* for a given IF is a 
point on one segment or on an intersection of two segments of this piecewise-linear function. Let x, and 
x' represent two different subsets of workers for which IF is maximum at two different schedule lengths. 
The two linear functions of load amounts that can be processed in time T by workers corresponding to 
X, and x' intersect at 


T(x, x') 


y-P 


E p Xi — x'^ 

2=1 Ai 


(9) 


Thus, the minimum distance between two different intersections is A = - ^5 -^ A^in. ^ ^ 

’ Amax Yli=l -f- pAmax ^ pAmax 

If the difference between two values of Ti, T 2 visited in the binary search is smaller than A, then either the 
same subset of workers gives maximum load for Ti and T 2 or two different subsets of workers are selected 
for Ti, T 2 . In the first case T* can be found using linear interpolation of function ( 8 ). In the second case 
there is one more intersection T 3 between Ti, T 2 , which can be found using (9), and then T* can be found 
using linear interpolation either to the left or to the right of T^. Since no schedule is longer than Smax + 
WAmax and the resolution is A, the binary search for T* over T values can be terminated in 0(log((S'max + 
wAmax)AmaxP)) = O{log W + logp+log Amax+'^og Smax) steps. The Complexity of the whole algorithm 
is not greater than 0((log IF+ logp + log Amax + fog Smax)p vnm{ [Smax + WA matcj ) 


4.4 Complexity considerations 

It follows from Section 4.2 that DLS is NP-hard, because its special case DLSjlRound, Ci =0} is NP- 
complete (see Figure 4). However, proving that DLS belongs to NP is quite difficult. Indeed, as we need 
to know for each activation the amount of data that is sent to each host, the complexity of a reasonable 
DLS certificate, i.e., the length of the string encoding a solution, will be at least id{actmax) (one needs 
at least to encode the activation sequence). 

Theorem 11. The optimal number of activations actmax for DLS-OpfT{W) is 12(-\/lF) for some in¬ 
stances. 


Proof. We consider the simple instance with p = 1, (Si, Ci,Ai) = (!> 1) !)• The only possible activation 
sequences are thus (1),(1,1),(1,1,1),.... Let us denote by r„(IF) the time needed to process IF units 
of load when using n activations without idle time. We can easily prove that 


r„(iF) = 


n + 1 
2 


n + 1 


IF 


n 
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Figure 4: Complexity hierarchy. Boxed problems belong to NP and arrows depict the NP-harder relation. 


Thus, we have T*(W) = minnTn{W) and we know that T* is a continuous piecewise-linear function. 
Therefore, slope modifications of T* occur for T„(fF) = T„+i(fF) i.e. for W = " . Therefore the 

optimal number of activation is Q{VW). □ 

The previous theorem means that there is no polynomial bound on actmax- Intuitively, the optimal 
number of activations may grow roughly linearly with \/W. Therefore, exponentially in the problem size 
because W is encoded with 0(log W) bits. If actmax were bounded by polynomial fractions in W, the 
^i’s, Ci’s and Si’s, then the complexity of our certificate would be 0(g(W, Ai, Ci, Si)). However, for 
problem DLS to be in NP the certificate length should be 0(r(log(IT),log(Hi),log(C'i),log(5'i))), where 
q, r are polynomials. In other words, the number of activations needed to reach the optimum may be far 
to big. It does not really prove that DLS probably does not belong to NP because we could use another 
kind of certificate. But it also seems hard to have a certificate smaller than the activation sequence as 
the fraction of load sent at each activation seems mandatory. 

As an alternative, we let the input include a bound on the number of activations. Such a hypothesis 
makes sense in practice as an arbitrarily complex schedule may not be desirable. We define problem 
DLS{Bounded} where we enforce that the maximum number of activations is bounded by the log of a 
bound K. 

Problem 6 (DLS{Bounded}). Given W, p workers, {Si)i,^i^p, a rational number 

T ^ 0, and an integer K, is it possible to eompute all W load units with at most log{K) aetivations 
within T time units after the master starts sending out the first load unit? 

DLS{ Bounded} clearly belongs to NP. DLSjlRound, Ci =0} being a particular instance of DLS{ Bounded}, 
DLSjBounded} is NP-complete as well. It should however be noted that, unlike many other problems 
(e.g., finding broadcast trees optimizing network throughput [5]), NP-hardness does not come from the 
bound on the number of activations, but from the resource selection problem. This bound on the number 
of activations is mainly an artifact of defining our problem in NP. 


5 Exact algrithms 

5.1 Mixed Integer Linear Programming 

(i) 

Let us assume that we set a bound actmax on the number of activations that can be performed. Let x) 
be a binary value indicating whether worker i is used at activation j. Let us recall that denotes 
the size of the chunk of load sent to worker i. Then our problem can be formulated as a mixed 
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optimization program: 


Minimize T , 

UNDER THE CONSTRAINTS 
OjCtmax P 

(10a) g = 

j=i i=i 

(10b) \/k ^ actmaxyi ■ 

Vi- 1..1 / 

p 

(10c) Vfc < actmax ■■ ^ 1 

(lOd) Vi,j : xp^ e {0,1} 

(lOe) Vi,j:ap^^0 


actjji (2 cc 


j=k 


( 10 ) 


Constraint (10a) enforces the fact that the whole workload W is processed by our workers. Con¬ 
straint (10c) impose that at most one worker is used in each activation. The leftmost part of Con¬ 
straint (10b) represents the time at which the communication ends and the middle one is a lower 
bound on the computation time of worker I after the activation. The sum of these two times has 
thus to be smaller than the makespan T. Considering in backward order the activations where a given 
worker I is used, it is not hard to see from the constraints that one will obtain a feasible schedule [16]. 

As such, this program is not linear. However, it can easily be transformed into an equivalent Mixed 
Integer Linear Program (MILP) by introducing a new variable as follows: 


Minimize T , 

UNDER THE CONSTRAINTS 
O'C.tmax P 

(11a) £ = W 

3=1 i=l 

( k p \ CLCtmax 

^ ^ ^ /3«A, < T 

3 = 1 i=l j 3=k 

P 

(lie) Vfc < actmax ■ ^ xf^ ^ 1 

i=l 

(lid) Vi, j : e (0,1} 

. (lie) Vi, j : 0 ^ < X?^W^ 

This MILP can be seen as a polynomial certificate to the DLSjBounded} problem where at most actmax 
activations are allowed. Such a program can easily be solved by using a branch and bound technique 
since the linear relaxation of the Xi'^^’s provides a lower bound on the solution of the original problem. 
In the rest of this article, we call BB the branch-and-bound algorithm that solves this MILP. 


5.2 Lighter Branch and Bound 

In this section we present a branch and bound algorithm, which we call BB-Light. It is based on the 
optimality principle decreeing that there is idle time neither in communications, nor in computations, 
as proved in Section 3. A branch and bound algorithm consists of two components: a branching scheme 
that guides the exhaustive search of the solution space, and a bounding method that prunes the search 
space. 

The branching scheme divides the search space into subsets which are either eliminated as certainly 
not providing an optimum solution, or divided into subsets to be further examined. The process of 
search space examination can be viewed as constructing of a tree. Each tree node represents a subset of 
solutions. In our case the search space consists of the sequences of communications to the workers. The 
algorithm starts with an empty communication sequence. The sequences are constructed by appending 
a new worker to some already existing sequence. For example, some sequence (Pa, • ■ •, Pz) of length I is 
expanded by appending all possible workers on the position / -|- 1. Thus, the node representing sequence 
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{Pa, ..., Pz) is branched into sequences {Pa, ■ ■ ■, Pz, Pi), - ■ ■, {Pa, ■ ■ ■, Pz,Pp)- Note that each sequence 
is a potential solution, for which distribution of the load and schedule length have to be calculated. 
These are obtained from a linear system (la), (lb), under the assumptions that these constraints are 
satisfied with equality, and all ajs are positive ((Ic) is satisfied). Under these assumptions loads aj 
may be calculated as a solution of a system of linear equations (la), (lb) rather than by using linear 
programming. An infeasible sequence is recognized if some aj < 0 (in other words (Ic) is not satisfied). 
The tree is searched in the depth-first order. An upper limit actmax is imposed on the depth of the tree. 

The bounding scheme eliminates solution subsets, i.e., tree nodes, for which there is no chance of 
obtaining a better solution than the best one already known. The quality of the node is evaluated by 
calculation of a lower bound on the length of the schedules constructed with its successors. Therefore 
a sequence {Pa,... ,Pz) represents all the sequences starting with the sequence {Pa,... ,Pz)- The lower 
bound is calculated as follows. A minimum workload Wi needed to keep processors in the sequence 
{Pa ,..., Pz) working is calculated using equations (la), (lb) with the additional assumption that a® = 0. 
From the same linear system, schedule length Ti can be found for load Wi. The remaining load W— Wi 
must be sent to the workers in time at least T 2 = mini^i^pICiKlU — Wi). In this time workers may 
compute at most W 2 = T 2 X]r=i ^ units. The remaining load must be processed in time at least 
T 3 = maxjO, Hence, the lower bound is Ti -|- T 2 -I- T 3 . 

2^i=l ^ 5 ” 

Note that the lighter branch and bound proposed here still has exponential worst case running time, 
but it is not using linear programming. 


6 Heuristics 

In this section we present several scheduling heuristics to solve the general DLS problem. For each 
heuristic we quantify the trade-off between the time to compute the schedule and the quality of the 
schedule. We explore a range of heuristics, from simplistic and fast to sophisticated and potentially time 
consuming. 

Most heuristics in this section attempt to determine a good activation sequence, and then compute the 
best chunk sizes. Therefore, these heuristics solve (10) for one or many activation sequences. Although 
( 10 ) may be solved faster in some cases by taking advantage of peculirities of the activation sequence, 
we use a generic linear program solver. Indeed, solving a linear program is very fast (a few miliseconds 
on a standard CPU) and thus does not lead to prohibitively long schedule computation times. Once the 
chunk sizes and computed, and given an activation sequence, the makespan can be easily computed. 

6.1 Simple Heuristics 

Also seen in Section 4.1, DLSIl-round.iSi =0} can be solved by sorting processors by increasing Ci’s in 
the activation sequence. Likewise, if one assumes that the total load is “sufficiently large”, workers should 
be sorted by increasing Ci’s just like when all Si’s are zero. Trying to cyclicly use all processors sorted by 
communication time is thus a natural heuristic. More formally, if we assume the Ci ^ C 2 ^ ^ Cp, we 

compute the optimal makespan associated to the activation sequences 7 ^, which consists k repetitions 
of the p sorted processors: 'jk = { 1 , ... ,p,l,... ,p,... ,1,... ,p}. 

k times {1, . . . , p} 

Note that if a processor receives 0 units of load in a round (as computed when solving (10)), we remove 
it from the activation sequence and its latency is therefore not counted in the makespan. We stop as 
soon as adding a new round does not lead to any improvement (i.e., as soon as Tk+i ^ Tk). We call this 
heuristic Communication-First. For the sake of completeness, we also developped Computation-First and 
Latency-First in which processors are sorted by increasing A^’s and increasing Si’s, respectively. 

6.2 Genetic Algorithm 

Genetic algorithms are randomized search methods that apply genetic operators on a pool of solutions 
with the goal of discovering the optimum. Genetic algorithms are widely used in solving discrete opti¬ 
mization problems, and details on various implementations and applications can be found, e.g., in [23, 26]. 
Here we only outline details of our implementation. Results of preliminary application of genetic algo¬ 
rithms on the DLS problem can be found in [17]. 

A solution is an activation sequence, which is encoded as a string of workers. The string has a 
predetermined length actmax- The quality of the solution is determined as schedule length T by the 
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linear program derived from (10) for a fixed communication sequence. A pool of G random solutions is 
generated on which the genetic operators of crossover, mutation, and selection are iteratively applied. 
Single point crossover has been applied to generate Gpc new solutions. Mutation changes GactmaxPM 
randomly selected destinations in the whole population. Selection chooses for the new population the 
best half of the old population, and for the second half of the population newly generated solutions are 
chosen using roulette wheel strategy. The above genetic operators are iteratively applied to construct 
new populations until an upper limit Ut on the number of iterations is reached. The algorithm also stops 
after reaching an upper limit Uq of the iterations without improving the quality of the best solution. 

Parameters G,pc,Pm,Ut, Uq were selected in the following way [17]. A set of 100 random instances 
were generated and solved by genetic algorithm. The measure of quality of tuning was the sum of the 
schedule lengths for all the instances, and the rate of its convergence with the iteration number. Note 
that this criterion is equivalent to the average relative distance from the optimum, but the actual optima 
need not be known. G = 50 was selected first, then pc = 0.8 ,pm = 0.03, and for these fixed parameters 
Ut = 100, Uo = 10 were finally chosen. We call the resulting algorithm GA. 


6.3 Uniform Multi-Round (UMR) 

In this section we briefly describe the UMR (Uniform Multi-Round) algorithm presented in [32], which has 
been specifically designed to schedule divisible loads. The idea behind UMR is simple: assign chunks of 
“uniform” sizes to all workers within each round, increasing the chunk size between rounds geometrically. 
Here “uniform” means that it takes the same amount of time for each worker to compute its chunk at 
each round (i.e., the product = Xj does not depend on i). 
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Figure 5: Multi-round schedule with UMR. 

For illustration purposes, a UMR schedule is depicted in Figure 5 for a heterogeneous platform. Time 
is shown on the x-axis, and workers are shown on the y-axis. The computation start-up costs Si are 
shown in dark grey boxes. We can see that the computation times of chunks are identical across all 
workers within each round (chunk dispatching for round j in the figure goes from time Ta to time Tg). 

In order to maximize network utilization, UMR imposes that the last worker receives data for round 
j + 1 exactly when it finishes its computation for round j, as seen on the figure. Such a condition is 
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Figure 6: Sketch of a periodic multi-round schedule using the first n workers Pi to P„, where n ^ p. 


imposed for all workers in the Multi-Installment algorithm [8] , which is unfortunately only applicable to 
homogeneous platforms. In this sense, UMR can be seen as a relaxation of Multi-Installment so that it 
can be applied to heterogeneous platforms. The above condition can be written as: 

a‘f^A = J2iS^k + ai^^^^ak). 

k^l 

With the above equation, along with the facts that Ai does not depend on i and that the sum of all 
chunk sizes sums up to the entire load, makes it possible to compute the chunk sizes recursively (they 
increase geometrically at each round). We refer the reader to [31] for all details. Note that in the last 
round UMR decreases chunk sizes within the round so that all workers finish computing at the same 
time. 

As seen earlier, a difficult issue is that of resource selection. UMR uses a simple heuristic that is 
inspired by the work in [1]: workers with faster networks (i.e., higher bandwidths) are selected first until 
no more worker can be used effectively. 


6.4 Periodic 


In this section, we briefly present the periodic asymptotically optimal algorithm from Beaumont et al. [2]. 
An algorithm is asymptotically optimal if the ratio of the time to execute a workload W over the optimal 
time to execute this workload tends to 1 as IT tends to infinity. The sketch of the algorithm is as follows. 
The overall processing time T is divided into k regular periods of duration Tp (see Figure 6). Let us 
consider the following linear program: 

Maximize p = ELi 
SUBJECT TO 

r VI ^ < p, 0 f3iA, ^ 1 

I ELi AG* < 1 


This linear program provides an upper bound p on the throughput that any schedule can reach (latency, 
start-up and close-up of the schedule are ignored). Resource selection is automatically performed during 
the resolution of this linear program. The periodic schedule is built such that W/k units of load are 
processed each period. Using, the previously computed values for the /3i’s and p, we define 
Hence we have: 


Tp = max 



- —Cl + bj,max 

p k i 



= max 



where 




b = Yii A, and 
a' = IT. maxi 


Therefore: 
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• If fc > (a' — a)/& then a/k + b > a'/k and the makespan is: 

T = (fc + l)Tp = a + b + a/k + bk, which is minimized for k = sjbja. 

• If fc ^ (o' — a) jb then a/k + b < a'/k and the makespan is: 

T = (fc + l)Tp = a' + a'/k, which is minimized for k as large as possible. 

The “optimal” number of round for such a periodic schedule is therefore k = ma,x{ y^b/a, {a' — a)/b). We 
call this algorithm Periodic. 

One drawback of the schedule computed by Periodic is that it is very rigid: the exact same amount 
of workload is sent to each worker during every period. Intuitively, rounds should be smaller in the 
beginning to allow a better overlap of communications and computations and a better start-up time 
(like with the heuristics described in Section 6.3). This is why we also propose a variant, Periodic- 
Optimized, that uses the exact same activation sequence, but computes the optimal values of the a^’s for 
this activation sequence by solving (10). The resulting schedule is likely to be more complex but also 
more efficient. 


7 Conclusion 

In this article, we have defined the multi-round divisible load scheduling problem and studied its com¬ 
plexity. We have stated and proved an optimality principle: For an optimal activation sequence and 
the corresponding optimal load distribution in DLS-OptW and DLS-OpfT , all messages, except maybe 
the trailing ones, convey some load and there is no idle time. We have proved that this problem is 
NP-complete even for simple instances (infinite bandwidth) and we have proposed a pseudo-polynomial 
algorithm for these instances. We have discussed the belonging to NP and showed that the optimal 
number of activations is not polynomial in the inputs of DLS. We also have presented a wide survey of 
previously known or original techniques to solve this problem. 

The complexity of a few problems remains open: 

• Computing a DL schedule entails three steps: (i) select which workers should participate in the 
computation; (ii) decide in which order workers should receive load chunks and how many times; and 
(iii) compute how much work each load chunk should comprise. In essence, the NP-completeness 
results is based on the selection problem. We have seen that computing the chunks’ sizes is easy. 
It would thus be interesting to know whether the problem is hard once the selection is done i.e., if 
the ordering problem is hard. 

• We have seen that the belonging of the general DLS problem to NP is unclear. Particularly because 
the optimal number of activations is not polynomial in the inputs of DLS. In practice, it is of course 
important to have a description of the schedule. And unless some particular structure of the optimal 
activation sequence is shown, it is very unlikely that it belongs to NP. We do not know however 
yet how to prove that a problem does not belong to NP. 

• DLSjlRound} clearly belongs to NP but we have only be able to prove its weak NP-completeness. 
The question whether it is strongly NP-complete or not remains open. 
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